Are Web Corpora Inferior? The Case of Czech and Slovak
نویسنده
چکیده
Our paper describes an experiment aimed to assessment of lexical coverage in web corpora in comparison with the traditional ones for two closely related Slavic languages from the lexicographers’ perspective. The preliminary results show that web corpora should not be considered ―inferior‖, but rather ―different‖.
منابع مشابه
Czech-Slovak Parallel Corpora for MT between Closely Related Languages
The paper describes suitable sources for creating Czech-Slovak parallel corpora, including our procedure of creating plain text parallel corpora from various data sources. We attempt to address the pros and cons of various types of data sources, especially when they are used in machine translation. Some results of machine translation from Czech to Slovak based on the acquired corpora are also g...
متن کاملAdaptation of Czech Parsers for Slovak
In this paper we present an adaptation of two Czech syntactic analyzers Synt and SET for Slovak language. We describe the transformation of Slovak morphological tagset used by the Slovak development corpora skTenTen and r-mak-3.0 to its Czech equivalent expected by the parsers and modifications of both parsers that have been performed partially in the lexical analysis and mainly in the formal g...
متن کاملSlavonic Corpus for Stylometry Research
Stylometry techniques such as authorship recognition, machine translation detection and pedophile identification are daily used in applications for the most widely used languages. But under-represented languages lack data sources usable for stylometry research. In this paper, we propose an algorithm to build corpora containing meta-information required for stylometry experiments (author informa...
متن کاملComparison of Slovak and Czech speech recognition based on grapheme and phoneme acoustic models
Grapheme based mono-, crossand bilingual speech recognition of Czech and Slovak is presented in the paper. The training and testing procedures follow the MASPER initiative that was formed as a part of the COST 278 Action. All experiments were performed using Czech and Slovak SpeechDat-E databases. Grapheme-based models gave equivalent recognition performance compared to phoneme-based models in ...
متن کاملTmTriangulate: A Tool for Phrase Table Triangulation
This work was supported by the grants no 645452 (QT21) and no 644402 (HimL) of the EU and SVV 260 104 of the Czech Republic. We used language resources hosted by the LINDAT/CLARIN project LM2010013 of the Ministry of Education, Youth and Sports. Introduction Under-resourced language pair: Scarcity of parallel corpora SMT Problem: No direct data → no SMT training Insufficient data → poor SMT per...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017